library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Question 1:

set.seed(1)

a)

Ydata <- runif(1000)
df <- tibble(y=Ydata)

for (i in 1:100){
  df[[paste0('x',i)]] <- runif(1000)
}
lm_fit <- lm(y ~ .,data = df)
smry <- summary(lm_fit)
i_count <- 0

for (i in 1:101){
  if (smry$coefficients[,4][i] < 0.05){
    i_count = i_count +1
  }
}
(i_count)
## [1] 6

From the above results, there exists some predictors whose marginal p-values are below \(0.05\). Since the value changes with each run of the data, my current estimate is \(8\) which could potentially decrease or increase for another run.

f <- smry$fstatistic
(smry)
## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6043 -0.2257 -0.0055  0.2321  0.6903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.903e-01  1.624e-01   3.634 0.000295 ***
## x1          -3.421e-02  3.272e-02  -1.046 0.296042    
## x2          -3.559e-02  3.346e-02  -1.064 0.287722    
## x3           2.026e-02  3.192e-02   0.635 0.525739    
## x4           3.439e-02  3.264e-02   1.054 0.292339    
## x5           8.913e-03  3.451e-02   0.258 0.796229    
## x6          -2.417e-03  3.328e-02  -0.073 0.942125    
## x7          -2.892e-03  3.300e-02  -0.088 0.930175    
## x8          -5.956e-03  3.364e-02  -0.177 0.859503    
## x9          -6.429e-03  3.221e-02  -0.200 0.841867    
## x10         -8.359e-02  3.356e-02  -2.491 0.012931 *  
## x11         -2.250e-02  3.359e-02  -0.670 0.503127    
## x12          3.285e-02  3.264e-02   1.007 0.314343    
## x13         -4.862e-02  3.306e-02  -1.470 0.141788    
## x14          1.617e-02  3.407e-02   0.475 0.635140    
## x15          3.012e-02  3.287e-02   0.916 0.359790    
## x16          1.070e-02  3.353e-02   0.319 0.749609    
## x17         -1.057e-03  3.331e-02  -0.032 0.974697    
## x18          4.892e-02  3.346e-02   1.462 0.144109    
## x19         -1.775e-02  3.338e-02  -0.532 0.595063    
## x20          5.202e-03  3.371e-02   0.154 0.877391    
## x21         -2.509e-02  3.360e-02  -0.747 0.455389    
## x22          1.128e-02  3.247e-02   0.347 0.728327    
## x23          1.241e-02  3.252e-02   0.382 0.702889    
## x24          5.281e-03  3.403e-02   0.155 0.876726    
## x25         -7.305e-02  3.365e-02  -2.171 0.030220 *  
## x26         -5.417e-02  3.408e-02  -1.589 0.112305    
## x27         -4.579e-02  3.371e-02  -1.359 0.174623    
## x28         -4.443e-02  3.312e-02  -1.341 0.180147    
## x29          1.407e-02  3.388e-02   0.415 0.677925    
## x30         -9.658e-03  3.305e-02  -0.292 0.770140    
## x31          7.948e-03  3.271e-02   0.243 0.808062    
## x32         -1.185e-02  3.329e-02  -0.356 0.721916    
## x33         -2.701e-02  3.381e-02  -0.799 0.424638    
## x34          2.995e-02  3.291e-02   0.910 0.363060    
## x35         -8.298e-03  3.394e-02  -0.244 0.806913    
## x36         -3.681e-02  3.385e-02  -1.087 0.277172    
## x37          2.545e-02  3.432e-02   0.742 0.458559    
## x38          4.045e-02  3.401e-02   1.189 0.234641    
## x39         -1.459e-02  3.320e-02  -0.439 0.660435    
## x40          1.811e-02  3.355e-02   0.540 0.589509    
## x41          3.742e-02  3.270e-02   1.145 0.252677    
## x42          4.224e-02  3.344e-02   1.263 0.206867    
## x43          1.692e-02  3.362e-02   0.503 0.614843    
## x44         -3.306e-02  3.387e-02  -0.976 0.329308    
## x45         -4.702e-02  3.351e-02  -1.403 0.160906    
## x46         -4.877e-02  3.242e-02  -1.504 0.132870    
## x47          2.837e-03  3.289e-02   0.086 0.931287    
## x48          1.652e-02  3.373e-02   0.490 0.624510    
## x49          1.244e-02  3.372e-02   0.369 0.712346    
## x50          2.164e-02  3.307e-02   0.654 0.513049    
## x51         -6.217e-02  3.259e-02  -1.908 0.056758 .  
## x52         -2.343e-02  3.317e-02  -0.707 0.480031    
## x53         -4.156e-02  3.379e-02  -1.230 0.219001    
## x54         -9.515e-03  3.255e-02  -0.292 0.770111    
## x55          3.855e-02  3.326e-02   1.159 0.246745    
## x56          9.661e-03  3.284e-02   0.294 0.768696    
## x57         -3.555e-02  3.241e-02  -1.097 0.273089    
## x58          3.764e-02  3.194e-02   1.179 0.238907    
## x59         -3.186e-02  3.271e-02  -0.974 0.330411    
## x60         -6.571e-03  3.423e-02  -0.192 0.847795    
## x61         -2.071e-02  3.296e-02  -0.628 0.529965    
## x62         -4.984e-03  3.404e-02  -0.146 0.883643    
## x63          5.211e-02  3.294e-02   1.582 0.113972    
## x64          1.063e-02  3.251e-02   0.327 0.743698    
## x65         -2.326e-02  3.303e-02  -0.704 0.481373    
## x66         -1.772e-02  3.257e-02  -0.544 0.586491    
## x67         -1.758e-03  3.329e-02  -0.053 0.957903    
## x68         -7.429e-03  3.389e-02  -0.219 0.826521    
## x69          5.449e-02  3.317e-02   1.643 0.100738    
## x70         -2.354e-02  3.324e-02  -0.708 0.478981    
## x71          2.314e-02  3.300e-02   0.701 0.483478    
## x72          1.995e-02  3.314e-02   0.602 0.547460    
## x73          3.226e-02  3.281e-02   0.983 0.325817    
## x74          3.779e-02  3.341e-02   1.131 0.258356    
## x75          4.316e-02  3.284e-02   1.314 0.189110    
## x76          2.805e-02  3.290e-02   0.853 0.393981    
## x77          3.638e-02  3.305e-02   1.101 0.271291    
## x78         -4.685e-04  3.281e-02  -0.014 0.988611    
## x79         -7.308e-03  3.313e-02  -0.221 0.825489    
## x80          6.049e-04  3.315e-02   0.018 0.985444    
## x81          1.248e-02  3.315e-02   0.376 0.706730    
## x82         -6.622e-02  3.339e-02  -1.983 0.047648 *  
## x83          7.077e-02  3.319e-02   2.133 0.033227 *  
## x84          5.766e-02  3.282e-02   1.757 0.079290 .  
## x85          2.976e-03  3.269e-02   0.091 0.927473    
## x86         -1.527e-02  3.256e-02  -0.469 0.639202    
## x87          1.617e-03  3.394e-02   0.048 0.962015    
## x88          3.140e-02  3.202e-02   0.981 0.327044    
## x89         -2.132e-02  3.314e-02  -0.643 0.520165    
## x90          1.877e-02  3.322e-02   0.565 0.572103    
## x91         -1.183e-01  3.338e-02  -3.545 0.000413 ***
## x92          9.479e-05  3.379e-02   0.003 0.997762    
## x93          3.530e-02  3.375e-02   1.046 0.295902    
## x94          2.124e-02  3.331e-02   0.638 0.523883    
## x95         -5.604e-02  3.299e-02  -1.698 0.089772 .  
## x96         -1.605e-02  3.299e-02  -0.487 0.626715    
## x97         -3.374e-02  3.235e-02  -1.043 0.297275    
## x98          9.989e-03  3.252e-02   0.307 0.758765    
## x99          3.839e-02  3.422e-02   1.122 0.262228    
## x100        -5.713e-02  3.322e-02  -1.719 0.085875 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2884 on 899 degrees of freedom
## Multiple R-squared:  0.1002, Adjusted R-squared:  0.0001042 
## F-statistic: 1.001 on 100 and 899 DF,  p-value: 0.4814
(f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE))
##    value 
## 0.481358

The pf as used above takes values for the vector quantiles, and the degrees of freedom. Here \(f(1)\) is the f-statistic value , \(f(2)\) the total number of degrees of freedom which is the total number of predictors in all and is the numerator and \(f(3)\) which is the residual degrees of freedom and is the denominator. With these values, \(pf\) gives the distribution function and is used to calculate the p-value of the F-test.

b)

results <- tibble(total_count = numeric(1000), p_value = numeric(1000))
N <- 1000

for (ij in 1:N){
  
  Ydata <- runif(1000)
  df <- tibble(y=Ydata)
  
  for (i in 1:100){
    df[[paste0('x',i)]] <- runif(1000)
  }
  
  # fitting the linear model
  lm_fit <- lm(y ~ .,data = df)
  smry <- summary(lm_fit)
  
  
  # count
  i_count <- 0

  for (i in 1:101){
    if (smry$coefficients[,4][i] < 0.05){
      i_count = i_count +1
    }
  }
  
  
  # p_value
  f <- smry$fstatistic
  f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
  
  
  results$total_count[ij] <- i_count
  results$p_value[ij] <- f_pval
  
}
cnt <- 
  results %>%
  filter(total_count >= 2) %>%
  summarise(n=n())

(cnt)/N
##       n
## 1 0.979
  1. Clearly , the proportion of the of the simulations in which at least one marginal p-value for a t-test is less than \(0.987\)
cnt <- 
  results %>%
  filter(p_value < 0.05) %>%
  summarise(n=n())

(cnt)/N
##       n
## 1 0.047
  1. Clearly , the proportion of the of the simulations in which the p-value for f-test is less than \(0.05\) is \(0.048\) according to the above results.

Question 2:

library(ggplot2)
## Boston Data
library(ISLR2)
coeff_data <- tibble(SLR = numeric(12), MLR = numeric(12))
(lm_summary <- summary(lm(crim ~ zn, data = Boston)))
## 
## Call:
## lm(formula = crim ~ zn, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## zn          -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06
coeff_data$SLR[1] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = zn, y = crim)) +
  geom_point() +  # scatter plot of data points
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

From the above, it is clear that not enough of the per capita crime rate by town is well explained by the proportion of residential land zoned for lots over \(24,000\) sq.ft zn. This is reflected both by the value of the \(R^2\) and the adjusted-\(R^2\). Moreover, the p-value for the coefficients is very small below \(0.05\) and almost insignificant. The graph below confirms this even though there is some small negative relationship.

(lm_summary <- summary(lm(crim ~ indus, data = Boston)))
## 
## Call:
## lm(formula = crim ~ indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## indus        0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[2] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = indus, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

From the above, it is clear that not enough of the per capita crime rate by town is well explained by the proportion of non-retail business acres per town indus. This is reflected both by the value of the \(R^2\) and the adjusted-\(R^2\).They seem a little high which tells at least some sort of variations in crim explained by indus but not much. Moreover, the p-values for the coefficients is very small below \(0.05\) and almost insignificant. Some small positive relation is seen in the graph.

(lm_summary <- summary(lm(crim ~ chas, data = Boston)))
## 
## Call:
## lm(formula = crim ~ chas, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## chas         -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094
coeff_data$SLR[3] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = chas, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Also from the above, it is clear that not enough of the per capita crime rate by town is well explained by the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), chas. This is due to the fact that the chas values are categorical and a simple linear regression cannot be used to determine the nature of the relationship between chas and the crim.This is most reflected by the graph of the fit. It is difficult to relate a value response to a categorical predictor with a simple linear regression.

(lm_summary <- summary(lm(crim ~ nox, data = Boston)))
## 
## Call:
## lm(formula = crim ~ nox, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## nox           31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[4] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = nox, y = crim)) +
  geom_point() +  # scatter plot of data points
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Even with very small p-values for the estimated coefficients, the variation in the per capita crime rate by town explained by nitrogen oxides concentration (parts per 10 million) is somewhat significant and a spike of positive relation can be seen between the two variables. This is most seen in the model fit and also the values of the \(R^2\) and adjusted-\(R^2\)

(lm_summary <- summary(lm(crim ~ rm, data = Boston)))
## 
## Call:
## lm(formula = crim ~ rm, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## rm            -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07
coeff_data$SLR[5] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = rm, y = crim)) +
  geom_point() +  # scatter plot of data points
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Comparing the per capita crime rate by town to the average number of rooms per dwelling, it can be seen from the graph that most of the data is peaked at the center and a spike of a negative relationship is spot between the two of them. Unfortunately, results from the model fit, \(R^2\) indicates that not much of the variation in crim is explained by the variations in rm and moreover, the p-values estimated for the coefficients are very small and almost insignificant .

(lm_summary <- summary(lm(crim ~ age, data = Boston)))
## 
## Call:
## lm(formula = crim ~ age, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## age          0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16
coeff_data$SLR[6] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = age, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Even with very small p-values for the estimated coefficients, the values for the \(R^2\) and adjusted \(R^2\) give some small significance about the variation in crim that is explained by the variation in the proportion of owner-occupied units built prior to 1940. Moreover, the graph for the fit indicates the positive nature of the relationship between both variables.

(lm_summary <- summary(lm(crim ~ dis, data = Boston)))
## 
## Call:
## lm(formula = crim ~ dis, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## dis          -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[7] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = dis, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

There is some small relationship between the per capita crime rate by town and the weighted mean of distances to five Boston employment centres as indicated by the graph of the fit which seems to be a negative relationship. Even amidst this, the estimated p-values for the coefficients are very small and almost insignificant. Moreover, the values of \(R^2\) and adjusted \(R^2\) indicate that some small variations in crim that is explained by the variations in dis.

(lm_summary <- summary(lm(crim ~ rad, data = Boston)))
## 
## Call:
## lm(formula = crim ~ rad, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[8] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = rad, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Although there is some small positive relation between the per capita crime rate by town and index of accessibility to radial highways, the distribution of the data in the graph of the fit tells how insignificant this association could be due to how the data are clustered in vertical forms. Using such a model would not properly predict values of the the crim.

(lm_summary <- summary(lm(crim ~ tax, data = Boston)))
## 
## Call:
## lm(formula = crim ~ tax, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## tax          0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[9] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = tax, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Even with some spike of positive relation between the per capita crime rate by town and the full-value property-tax rate per \(\$10,000\), results from \(R^2\), adjusted-\(R^2\) are quite small with almost insignificant p-values for estimated coefficients. Thus the model is able capture only a few of the data points in graph and the variations in crim cannot be properly explained by the variations in tax.

(lm_summary <- summary(lm(crim ~ ptratio, data = Boston)))
## 
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## ptratio       1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11
coeff_data$SLR[10] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = ptratio, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Similar to the results in the tax, the graph of the fit between the per capita crime rate by town and the pupil-teacher ratio by town is positive, but with majority of the data distributed at the maximum, it is hard for the variations in crim to be properly accounted for by the ptratio as this is illustrated by values of the \(R^2\) and the p-values of the estimated coefficients.

(lm_summary <- summary(lm(crim ~ lstat, data = Boston)))
## 
## Call:
## lm(formula = crim ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## lstat        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[11] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = lstat, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

The distribution of data points in the graph of the per capita crime rate by town to the lower status of the population (percent) seems to be fairly great and a positive association is seen between both variables. Even with very small p-values for the estimated coefficients, the value of \(R^2\) obtained though small is good enough which supports the fact that some variations in crim can be accounted for by the variations in lstat.

(lm_summary <- summary(lm(crim ~ medv, data = Boston)))
## 
## Call:
## lm(formula = crim ~ medv, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## medv        -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16
coeff_data$SLR[12] <- (lm_summary$coefficients[,1][2])
ggplot(Boston, aes(x = medv, y = crim)) +
  geom_point() +  
  geom_smooth(method = "lm", color = 'red') 
## `geom_smooth()` using formula = 'y ~ x'

Similar to the relation between crim and lstat, the relation between the per capita crime rate by town and the median value of owner-occupied homes in \(\$1000s\) is good enough with the value of \(R^2\) slightly significant to support the assertion that some variation in crim can be explained by variations in medv. The only difference here is that the relation is rather negative.

b)

(fit_summary <- summary(lm(crim ~ ., data = Boston)))
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.534 -2.248 -0.348  1.087 73.923 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
## zn           0.0457100  0.0187903   2.433 0.015344 *  
## indus       -0.0583501  0.0836351  -0.698 0.485709    
## chas        -0.8253776  1.1833963  -0.697 0.485841    
## nox         -9.9575865  5.2898242  -1.882 0.060370 .  
## rm           0.6289107  0.6070924   1.036 0.300738    
## age         -0.0008483  0.0179482  -0.047 0.962323    
## dis         -1.0122467  0.2824676  -3.584 0.000373 ***
## rad          0.6124653  0.0875358   6.997 8.59e-12 ***
## tax         -0.0037756  0.0051723  -0.730 0.465757    
## ptratio     -0.3040728  0.1863598  -1.632 0.103393    
## lstat        0.1388006  0.0757213   1.833 0.067398 .  
## medv        -0.2200564  0.0598240  -3.678 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.4359 
## F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

From the above results, for the predictors dis , rad, and medv , we fail to reject the null hypothesis that \(H_0 : \beta_j = 0\) considering their marginal p-values falling very low below \(0.05\). Moreover, the joint model possess a higher results for the \(R^2\) and adjusted-\(R^2\) which is an improvement in the individual relations which were considered at first.

c)

coeff_data$MLR <- fit_summary$coefficients[,1][2:13]
(coeff_data)
## # A tibble: 12 × 2
##        SLR       MLR
##      <dbl>     <dbl>
##  1 -0.0739  0.0457  
##  2  0.510  -0.0584  
##  3 -1.89   -0.825   
##  4 31.2    -9.96    
##  5 -2.68    0.629   
##  6  0.108  -0.000848
##  7 -1.55   -1.01    
##  8  0.618   0.612   
##  9  0.0297 -0.00378 
## 10  1.15   -0.304   
## 11  0.549   0.139   
## 12 -0.363  -0.220
plot(coeff_data$SLR, coeff_data$MLR, xlab = "Simple Linear Regression Coefficients", ylab = "Multiple Linear Regression Coefficients", main = "Plot of SLR coefficients against MLR coefficients")

The above graph indicates the plot of the coefficients from the simple linear regression against the coefficients obtained using the multiple linear regression model. Clearly, the coefficients for some predictors are affected by the presence of other predictors in the multiple linear regression. It is clear that, better results are obtained when a multiple linear regression model is used in fitting the data as compared to considering individual fits.

#For each of the predictors, fit a polynomial of degree 3....
#summary(lm(Boston$crim ~ poly(Boston$zn, 3)))

#Run a for loop and record the p-value for each ... compare the p-values and determine which is significant...

d)

polynomial_pvals <- tibble(names = rep(NA_character_, 12), pval = numeric(12))


for (i in 2:ncol(Boston)) {
  predictor <- names(Boston)[i]
  
  # Ensure that the predictor is numeric for poly() function
  if (is.numeric(Boston[[predictor]])) {
    formula <- as.formula(paste("crim ~ poly(", predictor, ", 3, raw = TRUE)"))
    lm_poly <- lm(formula, data = Boston)
    smry_poly <- summary(lm_poly)
    
    # Store predictor name and its corresponding p-value
    polynomial_pvals$names[i-1] <- predictor
    
    # Extract F-statistic and its p-value
    f <- smry_poly$fstatistic
    if (all(!is.na(f))) { 
      f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
      polynomial_pvals$pval[i-1] <- f_pval
    } else {
      polynomial_pvals$pval[i-1] <- NA 
    }
  } else {
    polynomial_pvals$pval[i-1] <- NA  }
}
(polynomial_pvals)
## # A tibble: 12 × 2
##    names       pval
##    <chr>      <dbl>
##  1 zn      1.28e- 6
##  2 indus   1.55e-32
##  3 chas    2.09e- 1
##  4 nox     3.81e-38
##  5 rm      1.07e- 7
##  6 age     1.02e-20
##  7 dis     3.14e-35
##  8 rad     2.31e-55
##  9 tax     7.34e-50
## 10 ptratio 4.17e-13
## 11 lstat   1.35e-26
## 12 medv    4.45e-59

After fitting polynomial curve of degree 3 to each of the predictors with the response, the p-values recorded are infinitesimally small and insignificant except for the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) which is a categorical data. From my observation, there is no non-linear association between the response variable and any of the predictor variables.

Question 3:

f_2_9 <- function(x) {
  -4.5 * ((x - 10) / 50)^3 + 9 * ((x - 10) / 50)^2 + 4.2
}

x_range_tibble <- tibble(x = c(0, 100))

set.seed(1)
n_train <- 45
n_test <- 15
var_epsilon <- 2

test_data <- 
  tibble(
    x = runif(n = n_test, min = 0, max = 100),
    true_f_x = f_2_9(x),
    y = true_f_x + rnorm(n = n_test, mean = 0, sd = sqrt(var_epsilon)),
    type = "te"
  )

train_data <- 
  tibble(
    x = runif(n = n_train, min = 0, max = 100),
    true_f_x = f_2_9(x),
    y = true_f_x + rnorm(n = n_train, mean = 0, sd = sqrt(var_epsilon)),
    type = "tr"
  )

df_seq <- seq(1, 10, 1)
MSEP <- tibble(df = rep(NA, length(df_seq)), train = NA, test = NA)

for (i in 1:length(df_seq)) {
  MSEP$df[i] <- df_seq[i]
  
 
  smsp_2_9 <- lm(y ~ poly(x, df_seq[i], raw = TRUE), data = train_data)
  

  pred_train <- predict(smsp_2_9, newdata = train_data)
  
 
  MSEP$train[i] <- mean( (train_data$y - pred_train) ^ 2 )
  
 
  pred_test <- predict(smsp_2_9, newdata = test_data)

  MSEP$test[i] <- mean( (test_data$y - pred_test) ^ 2 )
} 

MSEP_long <- pivot_longer(MSEP, cols = c(train, test), names_to = "Subset", values_to = "MSEP")

g_MSEP0 <-
  ggplot(MSEP_long) +
  geom_line(aes(x = df, y = MSEP, color = Subset)) +
  theme_bw()

min_test_df <-
  MSEP %>%
  filter(test == min(test)) %>%
  pull(df)

(ng <-
  g_MSEP0 +
  xlab("Flexibility") +
  ylab("Mean Squared Error") +
  geom_point(data = MSEP_long %>% filter(df == min_test_df), aes(x = df, y = MSEP), color = "black", shape = "X", size = 3))

ggplotly(ng) 

b)

From the above plots, it is evident that the training MSEP decreases as flexibility increases and this is due to the fact that increase in the degree of the polynomials fit the training data more closely. Morevoer, it can be inferred from the above that the training MSEP undergoes an initial decreases as flexibility increases, reaching a minimum at the optimal degree, then increases due to overfitting. The best bias-variance trade off is seen at the point where testing MSEP is minimized and is represented in the graph. Beyond that, there is the issue of overfitting as seen in the continual increase in the degrees of the polynomial(degrees of freedom).

c)

optimal_model <- lm(y ~ poly(x, min_test_df, raw = TRUE), data = train_data)
plot_data <- tibble(x = seq(0, 100, length.out = 1000))
plot_data <- plot_data %>%
  mutate(
    true_f_x = f_2_9(x),
    predicted_f_x = predict(optimal_model, newdata = plot_data)
  )

ggplot() +
  geom_line(data = plot_data, aes(x = x, y = true_f_x, color = "True Curve"), linewidth = 1) +  
  geom_line(data = plot_data, aes(x = x, y = predicted_f_x, color = "Least MSE Polynomial"), linewidth = 1) +  
  geom_point(data = train_data, aes(x = x, y = y, color = "Training Data"), alpha = 0.5) +
  geom_point(data = test_data, aes(x = x, y = y, color = "Testing Data"), alpha = 0.5) +
  theme_bw() +
  xlab("x") +
  ylab("y") +
  ggtitle("True Function vs. Estimated Function") +
  scale_color_manual(values = c("True Curve" = "black", "Least MSE Polynomial" = "orange", "Training Data" = "red", "Testing Data" = "green")) +
  theme(legend.title = element_blank())  # Optional: removes legend title

The graph above shows the plot of the train and test data with a fit of the true curve as given in the problem statement as well as the estimated polynomial regression with the least MSEP. Although the estimated polynomial fit does not match the exact curve, there is a resemblance and the variations can be due to the errors in estimating the weights or parameters for the model. The estimated regression function with the lowest MSEP for the testing set is a polynomial of degree \(( k = \text{min_test_df} )\). The function can be written as:

\[ \hat{f}(x) = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \dots + \hat{\beta}_k x^k \]

where $ _0, _1, , _k $ are the coefficients obtained from fitting the polynomial regression model to the training data. These coefficients can be extracted in R using:

coefficients(optimal_model)
##                       (Intercept) poly(x, min_test_df, raw = TRUE)1 
##                      6.319791e+00                     -1.991694e-01 
## poly(x, min_test_df, raw = TRUE)2 poly(x, min_test_df, raw = TRUE)3 
##                      6.765512e-03                     -4.634246e-05

From the results the coefficients are: \(\hat{\beta_0}=6.319791\), \(\hat{\beta_1}=-0.1991694\), \(\hat{\beta_2}=0.006765512\), and \(\hat{\beta_3}=-0.00004634246\). Thus, the estimated regression function is \[\hat{f}(x)=6.319791-0.1991694x+0.006765512x^{2}-0.00004634246x^{3}\]

The true function is given by \[f(x)=-4.5\bigg(\dfrac{x-10}{50}\bigg)^{3}+9\bigg(\dfrac{x-10}{50}\bigg)^{2}+4.2\] \[f(x)=-0.000036x^{3}+0.00468x^{2}-0.0828x+4.596.\] The estimated regression function with the lowest MSEP, \(\hat{f}(x)\), is a cubic polynomial, which corresponds to the form of the true function, \(f(x)\). This can be shown by using Wolfram alpha to simplify the polynomial and comparing the results in the estimated polynomial and the result. The model has effectively captured the underlying relationship, as evidenced by the estimated function’s coefficients being near to the genuine coefficients. The training set’s unpredictability or data noise are probably to blame for the small variations in the intercept and linear term. All things considered, the estimated function gives a decent approximation of the true function and generalizes well to the testing data.

Question 4:

a)

# Bias-Variance Trade-off -----
f_2_9 <- function(x) {
  -4.5 * ((x - 10) / 50)^3 + 9 * ((x - 10) / 50)^2 + 4.2
}

n_train <- 45
n_test <- 15
var_epsilon <- 2

set.seed(1)
# This is our single one and only test set:
test_data <- 
  tibble(
    x = runif(n = n_test, min = 0, max = 100),
    true_f_x = f_2_9(x),
    y = true_f_x + rnorm(n_test, sd = sqrt(var_epsilon))
  )

df_seq <- seq(1, 10, 1)
# We'll repeat the simulation 100 times:
M <- 100

# Data frame for all results (we need to average over MSE / bias / var for each x over many datasets,
# so we need to store all the predictions as a first step, then compute point-wise MSE / bias / var 
# and average over those)

# Important difference between this and the previous simulation:
# Here, we need to compute the variance and bias for each point. So
# we can't just sample a training dataset, compute a statistic (MSE)
# save it, and repeat. We need to save ALL predictions from all training
# datasets, and post-hoc compute the variance for each point across all datasets.


predictions <- NULL
for (m in 1:M) {
  if (m%%5 == 0) print(m)
  # Sample a training dataset: (many times)
  train_data <- 
    tibble(
      x = runif(n = n_train, min = 0, max = 100),
      true_f_x = f_2_9(x),
      y = true_f_x + rnorm(n_train, sd = sqrt(var_epsilon)),
      type = "tr"
    )
  
  # For each dataset, compute avg. bias, avg. variance and test MSE for each level of flexibility:
  for (i in 1:length(df_seq)) {
    smsp_2_9 <- lm( y ~ poly(x,df_seq[i],raw=TRUE), data = train_data)
    
    pred_test <- 
      as_tibble(predict(smsp_2_9, newdata = test_data)) %>%
      rename(pred_f_x = value) %>%
      mutate(
        true_f_x = test_data$true_f_x,
        true_y = test_data$y,
        test_obs_idx = factor(row_number()),
        m = m,
        df = df_seq[i]
      )
    
    predictions <- bind_rows(predictions, pred_test)
  }
}
## [1] 5
## [1] 10
## [1] 15
## [1] 20
## [1] 25
## [1] 30
## [1] 35
## [1] 40
## [1] 45
## [1] 50
## [1] 55
## [1] 60
## [1] 65
## [1] 70
## [1] 75
## [1] 80
## [1] 85
## [1] 90
## [1] 95
## [1] 100
(predictions)
## # A tibble: 15,000 × 6
##    pred_f_x true_f_x true_y test_obs_idx     m    df
##       <dbl>    <dbl>  <dbl> <fct>        <int> <dbl>
##  1     5.66     5.02   5.01 1                1     1
##  2     6.41     6.14   9.54 2                1     1
##  3     7.82     8.44   9.52 3                1     1
##  4    10.2      8.71   7.58 4                1     1
##  5     5.22     4.53   2.91 5                1     1
##  6    10.1      8.83   8.42 6                1     1
##  7    10.4      8.19   7.77 7                1     1
##  8     8.43     9.17   8.59 8                1     1
##  9     8.21     8.95   9.30 9                1     1
## 10     4.23     4.25   2.99 10               1     1
## # ℹ 14,990 more rows
bias_var_MSE <-                              
  predictions %>%                            # What does this object contain?
  group_by(test_obs_idx, df) %>%             # What does this line do?
  summarize(
    pt_wise_bias_sq = (mean(pred_f_x - true_f_x)) ^ 2,
    pt_wise_var = mean( (pred_f_x - mean(pred_f_x)) ^ 2 ),
    pt_wise_MSEE = mean( (pred_f_x - true_f_x) ^ 2 ),
    pt_wise_MSEP = mean( (pred_f_x - true_y) ^ 2 )
  ) %>%
  # How do the above formulas (in conjunction with the group_by and summarize functions) 
  # relate to the mathematical formulas for Bias, Variance, MSEE and MSEP?
  group_by(df) %>%                           # What does this line do?
  summarize(
    bias_sq = mean(pt_wise_bias_sq),
    var = mean(pt_wise_var),
    MSEE = mean(pt_wise_MSEE),
    MSEP = mean(pt_wise_MSEP)
  )
## `summarise()` has grouped output by 'test_obs_idx'. You can override using the
## `.groups` argument.
# What is the resulting object bias_var_MSE? What does it contain (rows, columns, values)?


# If we did this right, the summary statistic MSEP should equal the one we computed for Figure 2.9:
#all.equal(
#  MSEP_summary %>% filter(Subset == "test") %>% pull(mean_MSEP), 
 # bias_var_MSE$MSEP
#)


#all.equal(bias_var_MSE$MSEE, "")
# Always true

#est_var_epsilon <- ""
est_var_epsilon <- mean(bias_var_MSE$MSEP - bias_var_MSE$MSEE)

#bias_var_MSE$MSEP - (bias_var_MSE$bias_sq + bias_var_MSE$var + est_var_epsilon)
# True only in the limit

g <-
  bias_var_MSE %>%
  pivot_longer(c(bias_sq, var, MSEE, MSEP)) %>% 
  ggplot() +
  geom_line(aes(x = df, y = value, color = name)) +
  geom_hline(yintercept = est_var_epsilon, lty = 2) +
  theme_bw() +
  xlab("Flexibility") +
  ylab("Mean Squared Error") +
  theme(legend.title = element_blank())

ggplotly(g) 

b)

  1. The predictions object comprises a tibble which contains a record for each of the following columns from simulating each polynomial at some M different times:

** pred_f_x: The estimated value after fitting a polynomial model to the test data for each observation. ** true_f_x: This column contains the actual true values for the original model \(f_{2.9}(x)\) for each test observation. ** true_y: This column contains the true values of y from the simulated model for each given test observation. ** test_obs_idx: In this column, there are a set of unique identifiers for each of the observations for each of the 15 test observation indices. Thus for each \(m\) and each \(df\), there is a prediction for each of the 15 individual test observations. ** m: Out of the total \(M\) simulations we run for each degree of polynomial, \(m\) records the simulation number out of \(M\). ** df: This column records the degree of the polynomial used for a given simulation.

  1. group_by(test_obs_idx, df) takes the tibble from the prediction and splits the data into groups according to the test_obs_idx as well as the degree of the polynomial before finding the summary parameters( thus summarizing the data). It makes use of the group_by function in ordering the data. The grouping is done first according to the test_obs_idx and then followed by the df. We then perform a summary statistics on it.

  2. Beyond the group_by(), is the summarize() where we define our summary statistics. Here, we perform a point-wise summary statistics by using all the saved predictions for each of every \(15\) set of test observations and their corresponding degrees of freedom.We do this by computing the variance and bias for each point as well as the MSE and MSEP for the points in the \(M\) number of simulations we run and then average the results the results to find the point-wise statistic for every \(15\) test set observations and their corresponding degrees of polynomial. With \(M\), total number of simulations for a given polynomial and a 15 number of test observations, the variable names and formulas defined in the summarize() function conforms to the following: \(pt\_wise\_bias\_sq = [\dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-f(x))]^2\) which is the bias for the 15 test set for a given \(df\) and all \(m\). Similarly, \(pt\_wise\_var = \dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-\bar{f(x)})^2\), \(\bar{f(x)}\) the mean of the predictions. And we through that estimate the MSEE and MSEP for each \(15\) test set observation for all \(m\) and a given \(df\). \(pt\_wise\_MSEE = \dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-f(x))^2\) \(pt\_wise\_MSEP = \dfrac{1}{M}\sum_{i=1}^M(\hat{f}(x)-y)^2\).

  3. The result is passed on as a new data to be further summarized and this time around, grouped by the degrees of freedom(The degrees of the polynomial). Thus each degrees of freedom now has an estimate for its 15 point-wise estimates corresponding to the 15 test observations. For each 15-test set corresponding to a given df, we average the results and put for Bias, Variance, MSEP and MSEE and organize them under the given degrees of freedom.

  4. The resulting object bias_var_MSE is a tibble with 10 records and 5 columns with each record corresponding to a specific degree of the polynomial(which is the first column) and the other columns include the values for the Bias, Variances, MSEE and MSEP that corresponds to each of the degrees of freedom(the degree of a given polynomial).